reg sq
- North America > United States > California (0.14)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada (0.04)
- North America > United States > California (0.14)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada (0.04)
Conservative Contextual Bandits: Beyond Linear Representations
Deb, Rohan, Ghavamzadeh, Mohammad, Banerjee, Arindam
Conservative Contextual Bandits (CCBs) address safety in sequential decision making by requiring that an agent's policy, along with minimizing regret, also satisfies a safety constraint: the performance is not worse than a baseline policy (e.g., the policy that the company has in production) by more than $(1+\alpha)$ factor. Prior work developed UCB-style algorithms in the multi-armed [Wu et al., 2016] and contextual linear [Kazerouni et al., 2017] settings. However, in practice the cost of the arms is often a non-linear function, and therefore existing UCB algorithms are ineffective in such settings. In this paper, we consider CCBs beyond the linear case and develop two algorithms $\mathtt{C-SquareCB}$ and $\mathtt{C-FastCB}$, using Inverse Gap Weighting (IGW) based exploration and an online regression oracle. We show that the safety constraint is satisfied with high probability and that the regret of $\mathtt{C-SquareCB}$ is sub-linear in horizon $T$, while the regret of $\mathtt{C-FastCB}$ is first-order and is sub-linear in $L^*$, the cumulative loss of the optimal policy. Subsequently, we use a neural network for function approximation and online gradient descent as the regression oracle to provide $\tilde{O}(\sqrt{KT} + K/\alpha) $ and $\tilde{O}(\sqrt{KL^*} + K (1 + 1/\alpha))$ regret bounds, respectively. Finally, we demonstrate the efficacy of our algorithms on real-world data and show that they significantly outperform the existing baseline while maintaining the performance guarantee.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
Efficient Contextual Bandits with Uninformed Feedback Graphs
Zhang, Mengxiao, Zhang, Yuheng, Luo, Haipeng, Mineiro, Paul
Bandits with feedback graphs are powerful online learning models that interpolate between the full information and classic bandit problems, capturing many real-life applications. A recent work by Zhang et al. (2023) studies the contextual version of this problem and proposes an efficient and optimal algorithm via a reduction to online regression. However, their algorithm crucially relies on seeing the feedback graph before making each decision, while in many applications, the feedback graph is uninformed, meaning that it is either only revealed after the learner makes her decision or even never fully revealed at all. This work develops the first contextual algorithm for such uninformed settings, via an efficient reduction to online regression over both the losses and the graphs. Importantly, we show that it is critical to learn the graphs using log loss instead of squared loss to obtain favorable regret guarantees. We also demonstrate the empirical effectiveness of our algorithm on a bidding application using both synthetic and real-world data.
- North America > United States > California (0.14)
- North America > United States > Illinois (0.04)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.48)
Practical Contextual Bandits with Feedback Graphs
Zhang, Mengxiao, Zhang, Yuheng, Vrousgou, Olga, Luo, Haipeng, Mineiro, Paul
While contextual bandit has a mature theory, effectively leveraging different feedback patterns to enhance the pace of learning remains unclear. Bandits with feedback graphs, which interpolates between the full information and bandit regimes, provides a promising framework to mitigate the statistical complexity of learning. In this paper, we propose and analyze an approach to contextual bandits with feedback graphs based upon reduction to regression. The resulting algorithms are computationally practical and achieve established minimax rates, thereby reducing the statistical complexity in real-world applications.
- North America > United States > California (0.14)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
Infinite Action Contextual Bandits with Reusable Data Exhaust
Rucker, Mark, Zhu, Yinglun, Mineiro, Paul
Those who ignore history are doomed to repeat it. A modern variant of this truth arises in controlled experimentation platforms, where offline procedures are a critical complement to online tests, e.g., supporting counterfactual evaluation strategies (Agarwal et al., 2016), offline model selection (Li et al., 2015), and prioritization of scarce online experimental resources (Gomez-Uribe & Hunt, 2015). Consequently, the utility of a learning algorithm is not solely determined by online performance, but also by the post-hoc utility of the data exhaust. The recent contribution of Zhu & Mineiro (2022) exemplifies this: an online contextual bandit algorithm for infinite action spaces with O(1) space and time complexity with respect to the action set. Unfortunately, this performance is achieved by sampling from a distribution which is not absolutely continuous with the reference measure. Therefore, a variety of post-hoc evaluation procedures that rely on importance-weighting cannot be applied, limiting adoption. In this paper, we describe an alternative approach to infinite action spaces which not only enjoys similar smooth regret guarantee (and empirical performance), but also utilizes sampling distributions with well defined importance-weights. In exchange, we pay an increased computational cost.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Virginia (0.04)